Journal of Cheminformatics
○ Springer Science and Business Media LLC
Preprints posted in the last 90 days, ranked by how well they match Journal of Cheminformatics's content profile, based on 25 papers previously published here. The average preprint has a 0.02% match score for this journal, so anything above that is already an above-average fit.
Abbott, J. M.
Show abstract
Machine learning models for protein-ligand bioactivity prediction are increasingly used in computational drug discovery. However, reported benchmark performance is often sensitive to evaluation design. To further understand evaluation design strategies, we present a systematic evaluation of seven machine learning architectures for kinase inhibitor bioactivity prediction, spanning classical baselines (Random Forest, XGBoost, ElasticNet, multi-layer perceptron) and advanced neural approaches (Graph Isomorphism Network, ESM-2 protein embedding MLP, and a GNN-ESM fusion model). Using a curated ChEMBL-derived kinase activity dataset of 352,874 records across 507 human protein kinase targets, we evaluated all models under three splitting strategies of increasing stringency: random, scaffold-based (Bemis-Murcko), and target-held-out. We observed that Random Forest with Morgan fingerprints achieves near-equivalent or superior performance to all neural architectures under scaffold and target-based evaluation. On target-held-out splits frozen ESM-2 embeddings showed worse generalization, with ESM-FP MLP exhibiting the largest performance degradation. Learned graph representations (GIN) do not outperform fixed 2048-bit ECFP4 fingerprints at this data scale, and tree-based uncertainty methods outperform MC-Dropout implementations tested here on calibration and selective prediction metrics. A JAK kinase subfamily case study shows that protein-aware models achieved 79% top-1 selectivity accuracy versus 52% for pooled fingerprint models. However, stronger baselines using explicit target identity achieved 83-84%, indicating that ESM-2 embeddings in this study functioned primarily as an implicit target identifier. These results indicate that evaluation methodology and statistical rigor are major determinants of reported performance in bioactivity prediction. Benchmark design overview O_FIG O_LINKSMALLFIG WIDTH=177 HEIGHT=200 SRC="FIGDIR/small/719590v1_ufig1.gif" ALT="Figure 1"> View larger version (50K): org.highwire.dtl.DTLVardef@18b6fc8org.highwire.dtl.DTLVardef@157db3dorg.highwire.dtl.DTLVardef@fac215org.highwire.dtl.DTLVardef@dbfa6f_HPS_FORMAT_FIGEXP M_FIG C_FIG A curated ChEMBL kinase bioactivity dataset (352,874 records, 507 targets) was evaluated under three splitting strategies of increasing stringency. Seven model architectures spanning baselines, protein-aware, and graph neural approaches were each trained under 5-seed replication (105 total runs), with results analyzed across three complementary branches: the main 507-target benchmark, ESM-2 embedding ablation studies on a clean 92-target subset, and a JAK-family selectivity case study with stronger target-conditioned baselines
Ferreyra, S.; Dutra, I.; Galeano, A.; Paccanaro, A.
Show abstract
Drug-target affinity (DTA) prediction is a key task in drug discovery, enabling the estimation of the interaction strength between candidate compounds and biological targets. However, current models rely on connectivity-based molecular representations and do not explicitly account for the spatial organization, also known as stereochemistry. This limitation becomes evident when considering chirality, where a drug can exist as enantiomers, i.e., molecules that share the same atoms and bonds but differ in their three-dimensional arrangement. Despite their chemical similarity, they can interact differently with the same target, leading to variations in binding affinity and biological activity. In this paper, we propose a stereochemistry-aware DTA prediction framework that incorporates this information into molecular representations. Drug representations are learned from chemical structure using a directed-bond message passing graph neural network that captures enantiomers configurations, while protein targets are represented through sequence-based embeddings. Experiments on the Davis dataset demonstrate that our model can improve affinity prediction. Importantly, a case study on a manually curated dataset of enantiomers with different biological action shows that the model is able to distinguish the affinities in the two forms consistent with their experimentally observed biological activity. These findings support the relevance of stereochemistry-aware molecular representation for more accurate and chemically faithful DTA prediction.
Fieux-Castagnet, A.; Waton, J.; Glukhonemykh, A.; Snow, E.; Ashokkumar, R.; Fleming, J.; Champagne, D.; Devenyns, T.; Peluffo, A.; Anagnostopoulos, C.
Show abstract
Protein structure prediction models (such as AlphaFold, Chai, Boltz) have transformed structural biology and are increasingly explored for drug discovery; however, their utility for large-scale screening of antibody-antigen (AB-AG) interactions remains unclear, particularly for distinguishing true binding from non-binding pairs at scale. To our knowledge, there has not been an exhaustive exploration of Boltz-2 inference settings on this high impact problem, and in this paper we set out to describe and implement a novel benchmarking framework that can accelerate progress in the field. We evaluated Boltz-2 (NVIDIA NIM implementation) on 519 therapeutic monoclonal antibodies from Thera-SAbDab, pairing each antibody with its cognate target and a randomly assigned non-cognate antigen. We developed a novel evaluation framework that systematically captures variability across stochastic seeds while benchmarking different inference settings, including datasets with and without crystallographically resolved antibody structures. Across settings, Boltz-2-derived confidence metrics showed weak, though above-chance, discrimination (0.5 < ROC-AUC < 0.60). Among evaluated metrics, the minimum value of the interface predicted TM-score (ipTM-min) across seed-samples, captured the strongest signal. Interestingly, additional feature aggregation and multivariate modelling provided little to no improvement. Increasing the number of stochastic predictions yielded front-loaded gains, with diminishing returns beyond [~]15-20 seed-samples, suggesting limited value of extensive sampling in practical workflows. Notably, inference without multiple sequence alignments (MSAs) slightly improved performance on non-crystallized antibodies ({Delta}AUROC {approx} +0.027) while reducing runtime by [~]8 seconds per prediction compared to shallow MSA settings. Overall, these results indicate that off-the-shelf confidence metrics from general-purpose structure prediction models may be insufficient for reliable target-antibody screening and highlight the need for task-specific optimization, while confirming that modest amounts of sampling can be helpful, but not in itself sufficient to improve performance significantly as gains plateau relatively quickly.
Liu, T.; Jiang, S.; Zhang, F.; Sun, K.; Head-Gordon, T.; Zhao, H.
Show abstract
Large language models (LLMs) are in the ascendancy for research in drug discovery, offering unprecedented opportunities to reshape drug research by accelerating hypothesis generation, optimizing candidate prioritization, and enabling more scalable and cost-effective drug discovery pipelines. However there is currently a lack of objective assessments of LLM performance to ascertain their advantages and limitations over traditional drug discovery platforms. To tackle this emergent problem, we have developed DrugPlayGround, a framework to evaluate and benchmark LLM performance for generating meaningful text-based descriptions of physiochemical drug characteristics, drug synergism, drug-protein interactions, and the physiological response to perturbations introduced by drug molecules. Moreover, DrugPlayGround is designed to work with domain experts to provide detailed explanations for justifying the predictions of LLMs, thereby testing LLMs for chemical and biological reasoning capabilities to push their greater use at the frontier of drug discovery at all of its stages.
Ulusoy, E.; Bostanci, S.; Deniz, B. E.; Dogan, T.
Show abstract
MotivationMolecular representation learning is central to computational drug discovery. However, most existing models rely on single-modality inputs, such as molecular sequences or graphs, which capture only limited aspects of molecular behaviour. Yet unifying these modalities with complementary resources such as textual descriptions and biological interaction networks into a coherent multimodal framework remains non-trivial, hindering more informative and biologically grounded representations. ResultsWe introduce SELFormerMM, a multimodal molecular representation learning framework that integrates SELFIES notations with structural graphs, textual descriptions, and knowledge graph- derived biological interaction data. By aligning these heterogeneous views, SELFormerMM effectively captures complementary signals that unimodal approaches often overlook. Our performance evaluation has revealed that SELFormerMM outperforms structure-, sequence-, and knowledge-based models on multiple molecular property prediction tasks. Ablation analyses further indicate that effective cross-modal alignment and modality coverage improve the models ability to exploit complementary information. Overall, integrating SELFIES with structural, textual, and biological context enables richer molecular representations and provides a promising framework for hypothesis-driven drug discovery. AvailabilitySELFormerMM is available as a programmatic tool, together with datasets, pretrained models, and precomputed embeddings at https://github.com/HUBioDataLab/SELFormerMM. Contacttuncadogan@gmail.com
Quargnali, G.; Rivera-Fuentes, P.
Show abstract
Deep learning methods for protein structure generation, sequence design, and structure and property prediction have created unprecedented opportunities for protein engineering and drug discovery. However, using these tools often requires navigating incompatible software environments, diverse input/output formats, and high-performance computing infrastructure, any of which may hinder adoption by primarily experimental chemical biology laboratories. Here we present BioPipelines, an open-source Python framework that allows researchers to define multi-step computational design workflows in a few lines of code. Additionally, its robust yet modular architecture provides a straightforward way to expand the toolkit with different functionalities, particularly by leveraging coding agents, with little effort. The framework currently integrates over 30 tools encompassing structure generation, sequence design, structure prediction, compound screening, and analysis. The same workflow code can be prototyped interactively in a Jupyter notebook and then submitted for production-scale runs without modification. We demonstrate applications in inverse folding, gene synthesis, de novo protein design, compound library screening, iterative binding site optimization, and fusion-protein linker optimization. We hope this framework will empower researchers, allowing them to focus on the scientific question rather than computational logistics. BioPipelines is available under the MIT license at https://github.com/locbp-uzh/biopipelines.
Wu, R.; Mao, L.; Diao, Y.; Li, H.
Show abstract
Drafting Markush claims for chemical patents remains difficult because manual claim writing is slow, error prone, and often fails to capture related chemical space in a systematic manner. We developed SpaceExpander, a computational method that converts disclosed compounds into generalized Markush claims by extracting core scaffolds, defining variable positions, decomposing complex substituents, and expanding substituent space through fragment matching. We evaluated the method on 24 publicly available chemical patents and compared its performance with IntelliPatent. SpaceExpander achieved a mean atom level scaffold accuracy of 0.92 and exactly recovered the reference scaffold in 19 of 24 patents. By contrast, IntelliPatent could process only 2 patents from the same set, indicating more limited applicability to structurally diverse cases. We further examined practical claim coverage in a case study based on the Osimertinib patent. Using representative disclosed compounds as input, SpaceExpander drafted a Markush claim that covered 5 of 7 additional approved third-generation EGFR inhibitors beyond Osimertinib. These results show that SpaceExpander is a validated method for automated Markush claim drafting and chemical space expansion.
Yoo, J.; Shin, W.-H.
Show abstract
MotivationFragment-based drug discovery (FBDD) is an efficient strategy that leverages small molecular fragments to explore broader chemical space by combining them. Advances in computational methods have enabled the calculation of molecular properties and docking scores, thereby accelerating the development of algorithm- and AI-based approaches in FBDD. However, it should be noted that certain methods do not provide synthetic pathways to obtain the proposed compounds. Consequently, these molecules might not be synthesized easily. ResultsIn light of these developments, we propose MOZAIC, a novel framework that explores chemical space using a reaction-based fragment growing and Conformational Space Annealing, a powerful global optimization algorithm. Our results show that MOZAIC effectively produces chemically diverse molecules with balanced improvements in lead-like properties, including QED, synthetic accessibility, and binding affinity. Furthermore, its flexible objective function allows fine-tuning for specific design goals, such as enhancing solubility with binding affinity. These capabilities position MOZAIC as a valuable platform for advancing fragment-to-lead and lead optimization efforts in drug discovery. Availability and implementationMOZAIC is available at https://github.com/kucm-lsbi/MOZAIC/. Supplementary InformationSupplementary data are available at Bioinformatics online.
Nada, H.; Sipos-Szabo, L.; Bajusz, D.; Keseru, G.; Gabr, M.
Show abstract
Despite advances in computational drug discovery, de novo drug design remains hindered by high licensing costs and the need for specialized programming expertise. We present LigandForge, a webserver for structure-guided de novo ligand generation. LigandForge integrates structural validation and binding-site characterization; voxel-based property grid construction for spatial mapping of electrostatics and hydrophobicity; chemistry-aware fragment assembly; multi-objective lead optimization; and retrosynthetic feasibility analysis. The platform utilizes a structure-guided framework to assemble molecules from curated fragment libraries while enforcing physicochemical constraints, including molecular weight, LogP, and hybridization states. Generated molecules are refined via reinforcement learning and genetic algorithms which are subsequently evaluated using composite metrics such as the quantitative estimate of drug-likeness. By leveraging RDKit for cheminformatics and NGL viewer for real-time 3D visualization, LigandForge provides a synthesis-aware environment that bridges the gap between macromolecular structural data and experimentally feasible lead compounds without requiring local software installation.
Depenveiller, C.; Guerda, A.; Rabia, E.; Caidi, A.; Ashhab, Y.; Mami-Chouaib, F.; Montes, M.
Show abstract
Protein-peptide interactions underlie many cellular signaling and regulatory processes and are increasingly exploited in drug discovery. Characterizing such interfaces often requires the analysis of ensembles of conformations obtained by molecular modeling or molecular dynamics (MD) simulations, where transient contacts and alternative binding modes can be critical. Pharmacophore models provide an intuitive, transferable representation of molecular interactions. Dynophore i.e. "dynamic pharmacophore" approaches have been developed for small-molecule ligands with MD information. We present PP-MAPS (Protein-Peptide Molecular dynamics Assisted Pharmacophore Signatures), an open-source workflow that extracts and aggregates pharmacophore interactions along MD trajectories of protein-peptide complexes. PP-MAPS produces per-residue interaction frequencies and pharmacophore heatmaps that facilitate comparison of peptides, binding sites and receptor variants. PP-MAPS is implemented in Python and is available under an open-source license at https://github.com/camilledepenveiller/PP-MAPS. The workflow relies on GROMACS for trajectory processing and can use either LigandScout or the Chemical Data Processing Toolkit (CDPKit) for pharmacophore feature detection.
Xie, L.; Ye, E.; Wang, H.; Zhang, T.; Zhen, Q.; Liang, F.; Liu, D.; Zhang, G.
Show abstract
BackgroundThe function of a protein is intrinsically linked to its three-dimensional fold, and deep learning has revolutionized the field by enabling high-accuracy structure prediction at an unprecedented scale. Nevertheless, the growing deployment of these predictive pipelines in drug discovery and structural biology reveals a critical bottleneck that lies in the lack of independent and rigorous estimation of model accuracy (EMA) methodologies. ResultsHere we present DeepUMQA-Global, a single-model deep learning framework for estimating accuracy of protein structure models. Our method employs a structure-sequence cross-consistency mechanism to evaluate the bidirectional compatibility between the predicted structure and the input sequence, enabling comprehensive characterization of fold accuracy. DeepUMQA-Global outperforms the self-assessment confidence scores of AlphaFold3, achieving improvements of 57.8% in Pearson correlation and 49.0% in Spearman correlation. With respect to the CASP16 retrospective benchmark, DeepUMQA-Global outperforms all single-model accuracy estimation methods that participated in CASP16 and achieves performance comparable to that of the top consensusObased methods. A lightweight consensus strategy built upon DeepUMQA-Global ranks first among all CASP16 participants, surpassing all other methods, including consensus approaches, and highlighting the strength of our method. Remarkably, DeepUMQA-Global demonstrates a strong ability to discriminate between alternative conformational states of proteins, as evidenced in the CASP unique alternative conformation protein complex target and the CoDNaS benchmark. ConclusionsOur results indicate that DeepUMQA-Global can be extended to broader protein modeling tasks, moving beyond static evaluation to offer a foundation for dynamic conformation EMA, where it accurately discriminates alternative conformational states and demonstrates reliable predictive fidelity in model accuracy estimation.
Teixeira, J. P.; Bajay, M. M.; Freire, C. C. d. M.; Bettin, L. B. F.; Soares, A. P.; de Lima Neto, D. F.
Show abstract
Zika virus (ZIKV), yellow fever virus (YFV), West Nile virus (WNV), Usutu virus (USUV), and Saint Louis encephalitis virus (SLEV) remain major public health concerns, yet broad-spectrum antiviral options are limited. Here, we present an open-source, reproducible software workflow for pocket-oriented virtual screening and ADME-integrated chemoinformatics, designed to support standardized multi-target compound prioritization. As a case study, the workflow was applied to structural and nonstructural proteins from clinically relevant flaviviruses. Automated pocket detection using Concavity reduces site-selection bias by generating docking boxes from surface concavity clusters, while standardized downstream scripts parse docking logs, convert docking-derived binding energies into Kd-related metrics, integrate SwissADME descriptors, and compute LE, LLE, FQ, and drug-likeness rules. The framework also supports retrospective validation and comparative benchmarking using literature-supported reference compounds and target-specific plausibility checks. Rather than proposing experimentally validated antiviral candidates, this study provides a reusable computational framework for hypothesis generation, benchmarking, and downstream experimental prioritization in structure-based drug discovery. The workflow is modular and adaptable to other multi-target screening campaigns where integrated ranking across binding, physicochemical, and ADME dimensions is required. SUMMARYWe describe an open-source, reproducible software workflow that integrates pocket-oriented docking, ligand efficiency scoring, ADME descriptor integration, and multivariate chemoinformatics to standardize compound prioritization across multiple protein targets. The workflow combines open-source tools with auditable Bash, R, and Python scripts and is demonstrated through a multi-target flavivirus case study. Rather than claiming experimentally validated antiviral activity, the framework is intended to support hypothesis generation, retrospective benchmarking, transparent reporting, and downstream experimental prioritization.
Kunnakkattu, I. R.; Choudhary, P.; Midlik, A.; Fleming, J. R.; Balasubramaniyan, B.; Sasidharan Nair, S.; Velankar, S.
Show abstract
Three-dimensional structures of protein-ligand complexes are essential for insights into the molecular principles that govern ligand recognition and binding. With more than 180,000 ligand-bound entries in the Protein Data Bank (PDB), representing over two million individual complexes, the volume of available structural data offers unprecedented opportunities for large-scale analysis of interaction patterns. Analysis of interaction patterns across the PDB archive can help discover similarities and differences in the binding modes of ligands, assisting in drug discovery. However, large-scale analysis of up-to-date information remains a significant challenge due to the rapid growth of data. Here, we introduce the Extended Connectivity Interaction Fingerprint (ECIFP), an interaction-based fingerprint that simplifies 3D protein-ligand contact information into a fingerprint, while retaining key molecular and chemical features of the interacting fragments. The simpler fingerprint representation of the interaction data makes comparison of millions of protein-ligand complexes tractable. Benchmarking shows that ECIFP outperforms ligand-only Extended Connectivity Fingerprints in identifying similar binding sites across identical protein sequences occupied by chemically diverse ligands. Our analysis showed that similarities calculated using ECIFP can be used to compare macromolecular complexes with similar or different ligands. In this study, we demonstrate two large-scale applications of ECIFP: (1) identification of distinct binding modes for over 9,000 ligands across the entire PDB, and (2) detection of binding-mode similarities among structurally diverse ligands within the same binding site across 48,870 binding sites from over 21,000 proteins.
Dohi, E.
Show abstract
We screened a 5 receptor x 7 aptamer = 35-cell cross-target matrix with HADDOCK3 [1] under blind ambiguous-interaction-restraint (AIR) protocols on AlphaFold-modelled receptors. The screen surfaced 12 operationally distinct failure modes (collapsing to [~]8 conceptual classes; [§]3.1). The K_D-calibration subset is n = 4 cells with literature K_D records under matched assay conditions; the broader cohort includes [≥] 6 biological cognate or intended-cognate cells. The principal case study is P01031 (complement C5, 1676 aa, [≥] 12 structural domains): all 7 panel members produced positive HADDOCK3 top-1 scores under a scale-adaptive AIR. Score-term decomposition locates the anomaly in the AIR term (+217 to +268 to top-1 score). With AIR zeroed, scores fall to -131 to -74 -- the small-receptor regime. Boltz-2 cofolding chain-pair ipTM (cpi_AB) is an independent channel: P01031 shows the lowest median cpi_AB (0.211; 0/7 above the 0.5 confident-interface threshold). To our knowledge, this is the first reported case study of a 1676 aa multi-domain receptor exhibiting this signature under blind scale-adaptive AIR -- an n = 1 mechanistic case, not a statistical generalisation. We adapt the QSAR applicability domain concept [14-16] to in silico aptamer screening. [§]3.7 reports an empirical Mode 1 mitigation (pLDDT-aware AIR prefilter; cohort Jaccard recovery [~]10x).
Bai, J.; Prince, S.; Nitschke, G. S.
Show abstract
Recent deep learning models for L1000 chemical perturbation prediction incorporate dedicated drug molecular encoders. We retrained seven such models from scratch with zeroed or shuffled drug inputs, and compared them with a multilayer perceptron that uses only cell-line basal expression. Under drug-blind evaluation, ablation caused negligible performance changes and the drug-free baseline matched all models. Current architectures do not yet utilise drug molecular features for generalisation to unseen compounds.
Barsainyan, A. A.; Panda, R.; Siguenza, J.; Merico, D.; Ramsundar, B.
Show abstract
The problem of identifying which protein target a potential drug-like molecule interacts with is crucial for both the study of existing drugs and the design of new therapeutic compounds. Despite the importance of target identification, existing computational approaches remain limited in terms of speed, accuracy, and protein target coverage. We introduce ProteomeScan, a large-scale, gene-driven computational toolkit for systematic proteome-wide scanning to uncover hidden or previously uncharacterized protein-ligand interactions. ProteomeScan leverages cloud-scale high performance computing to perform extensive molecular docking simulations across the human proteome to rank candidate targets based on binding affinities. After filtering promiscuous targets, we found that ProteomeScan ranks known target significantly better than a random baseline for a set of control compounds. Furthermore, we performed physical analyses of predicted binding modes for both promiscuous and known protein-ligand binding pairs to validate that ProteomeScan identifies interactions with valid binding pockets. In addition, we conducted experiments using mutant variants of proteins to study how mutations affect binding behavior. We have open sourced the core ProteomeScan algorithm as part of the DeepChem ecosystem to enhance transparency and reproducibility. Author summary
Roehrig, U. F.; Mathieu-Bugnon, M.; Zoete, V.
Show abstract
MotivationMolecular docking is a pillar of structure-based drug design and shows advantages in structure prediction of small-molecule ligand-protein complexes over co-folding methods for novel ligands and novel binding pockets. Here, we describe substantial improvements of our physics-based docking algorithm Attracting Cavities, which is widely used through the SwissDock webserver. ResultsAC 3.0 includes enhanced sampling features, new functionalities, and technical improvements. These lead to better sampling at lower execution times and higher versatility. Comparison with AutoDock Vina demonstrates better docking results on multiple test sets. AvailabilityAC 3.0 will be made available free of charge through the SwissDock webserver (www.swissdock.ch).
Wang, Y.; Rao, J.; Zhang, W.; Shi, Y.; Zeng, C.; Cui, R.; Wang, Y.; Xiong, J.; Li, X.; Zheng, M.
Show abstract
Accurate prediction of drug metabolites and enzyme selectivity is essential for rational drug design and safety assessment. However, existing computational approaches are often limited to specific enzyme families or reaction types, lacking the capacity to model enzyme-subtype specificity and prioritize major metabolites. Here, we present MetaReact, an end-to-end generalizable Transformer-based model that unifies the prediction of metabolic enzymes, metabolites, and sites of metabolism (SOM). By integrating structure-aware encoding ReactSeq, a chemistry reaction-based pretraining, MetaReact consistently outperforms state-of-the-art methods across multiple benchmarks under three settings: enzyme-agnostic, enzyme-completion, enzyme-conditioned. Notably, it achieves 60% Top-3 accuracy in identifying major metabolites and superior CYP450 enzyme-subtype prediction/SOM recognition. Case studies validate its applicability to complex natural products, synthetic cannabinoids, and clinical candidates, facilitating toxicity assessment and molecular optimization. This scalable, rule-free solution advances human metabolism modeling, with potential for computational pharmacokinetics and early drug discovery.
Aldas-Bulos, V. D.; Plisson, F.
Show abstract
Machine learning continues to accelerate peptide and protein design through the rapid prediction and generation of sequences with desired characteristics. Many applications focus on predicting properties, functions, and structures, as well as generating point mutations and de novo designs. Nevertheless, many models prove less generalizable than initially claimed. Most predictors and generators are trained on sequential datasets, where imbalances can be addressed during preprocessing. In contrast, structural bias, a subtype of algorithmic bias arising from uneven representation of structural classes in training datasets, and the limitations of early protein structure predictors have frequently remained undetected and uncorrected. The recent surge in powerful protein structure prediction tools, such as the AlphaFold and RosettaFold series and their variants, now presents opportunities to mitigate this issue. We hypothesize that such structural sampling biases influence the downstream performance of ML models. Using antimicrobial peptides as a case study, we audited the structural biases in 16 state-of-the-art predictors for antimicrobial activity and tested whether structural information constrains their predictions. Our analysis revealed that models explicitly trained on sequential data still produce predictions biased by uneven fold representations and data leakage. These findings highlight the importance of integrating balanced structural data or implementing bias-mitigating strategies to develop agnostic models that maximize bioactive protein discovery and multi-objective optimization.
Lala, J.; Agrawal, H.; Dong, F.; Wells, J.; Angioletti-Uberti, S.
Show abstract
AO_SCPLOWBSTRACTC_SCPLOWWe present a general approach to find amino acid sequences corresponding to the most compact enzyme likely to retain the structure of a given catalytic site. Our approach is based on using Monte Carlo (MC) simulations to sample an energy landscape where minima correspond, by construction, to sequences with the aforementioned properties. Building on previous work (Wu et al., 2025) and with the BAGEL package (Lala et al., 2025), we implement a route to achieve this goal using only the information extracted from a protein language model (PLM), without structural information. After generating a set of candidate sequences with this PLM-guided BAGEL optimization, we further filter potential candidates for downstream experimental validation using a two-stage protocol. First, deep-learning-based structure prediction models (ESMFold, Chai-1, Boltz-2) are used to identify a structural consensus among designs with highly conserved active-site geometries, yielding many candidates with active-site RMSD below a few angstroms relative to the wild-type and pLDDT scores above 80. Second, molecular dynamics simulations are performed on a filtered subset of sequences (based on active-site RMSD and SolubleMPNN log-likelihoods) to evaluate active-site stability when including thermal fluctuations. For the most promising enzymes, these yield RMSF values in the active site below 1.0 [A] and an active-site RMSD drift between 0.5 and 1.5 [A], making these mini-variants comparable to the wild type, though outcomes vary across enzymes. Given the protocols generality, we believe these results represent a step forward in AI-guided enzyme design. To facilitate rapid experimental validation by the broader community, we open-source all sequences generated by our computational pipeline. These include designs for four representative enzymes of this study: PETase, subtilisin Carlsberg (serine protease), Taq DNA polymerase, and VioA.